Spark Transformation Actions

Immutable
  • Creating data frame Data Frame is immutable data structure.

We can’t Change them so how we are going to process them

but we can give instruction to driver what to do like 😀

  • filter(age<40)
  • groupBy(county.count())

Now driver will decide how to achieve it by instruction.

This instruction are called Transformer

Transformer -- same sa SQL like operation


import sys
import os 

from pyspark import SparkConf
from pyspark.sql import SparkSession
from lib.logger import Log4J
from lib.utils import get_spark_app_config

if __name__ == "__main__":
	conf = get_spark_app_config()
	spark = SparkSession.builder \
			.config(conf=conf) \
			.getOrCreate()
	logger = Log4J(spark)
	if len(sys.argy) != 2:
		logger.error("Usage: HelloSpark <filename>")
		sys.exit(-1)

	logger.info("Starting Spark Session")

	conf_out = spark.SparkContext.getConf()
	logger.info("Finished HelloSpark") 

	# ---- Read The Dataframe ----
	spark_df = spark.read \
				.option('header','true') \
				.csv(sys.argv[1])
	spark.stop()

We need to modify transformation by using spark_df

spark_df.where("Age <40") \
	.select("Age", "Gender", "Country","state") \
	.groupBy("Country")

We can do like this or break it to variable also
Here we are creating a graph of operation.

Pasted image 20240618144915.png

This all are going to be create as of activities
DAG

These are 2 type of operation :
- Transformation
- Narrow Dependency Transformation
- Wide Dependency Transformation
- Action


Where
Pasted Image 20240618151714_691.png
Group By
Pasted image 20240618152039.png

How to Fix this Group by as result isn't perfect ?
we can do this by Combine & Repartition

Pasted image 20240618152453.png

Combine all partition 

Pasted image 20240618155701.png
TransformAction

Pasted image 20240618155733.png

*

Lazy Evalution


functional programming technique -
Pasted image 20240618170427.png
but Spark Programming -- will using builder pattern to create a - DAG of transformations

Pasted image 20240618172352.png

What is Action ?

Transforming one df to another df as expecting result. here `show()` is the action

All below are is Action 😀
- Read
- Wrote
- Collect
- Show

Spark Action will terminate the transformation dag and trigger the execution.

Pasted image 20240618190042.png